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Abstract 

Background: Transcriptome sequencing is a powerful tool for measuring gene expression, but as well as some 
other technologies, various artifacts and biases affect the quantification. In order to correct some of them, several 
normalization approaches have emerged, differing both in the statistical strategy employed and in the type of 
corrected biases. However, there is no clear standard normalization method. 

Results: We present a novel methodology to normalize RNA-Seq data, taking into account transcript size, GC 
content, and sequencing depth, which are the major quantification-related biases. In this study, we found that 
transcripts shorter than 600 bp have an underestimated expression level, while longer transcripts are even more 
overestimated that they are long. Second, it was well known that the higher the GC content (>50%), the more the 
transcripts are underestimated. Third, we demonstrated that the sequencing depth impacts the size bias and 
proposed a correction allowing the comparison of expression levels among many samples. The efficiency of our 
approach was then tested by comparing the correlation between normalized RNA-Seq data and qRT-PCR 
expression measurements. All the steps are automated in a program written in Perl and available on request. 

Conclusions: The methodology presented in this article identifies and corrects different biases that influence 
RNA-Seq quantification, and provides more accurate estimations of gene expression levels. This method can be 
applied to compare expression quantifications from many samples, but preferentially from the same tissue. In order 
to compare samples from different tissue, a calibration using several reference genes will be required. 
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Background 

The study of transcriptome has pushed forward by the 
development of next-generation sequencing technologies. 
RNA-Seq offers the possibility to get information on 
sequence and quantification of all transcribed genes, 
but extremely lowly expressed ones [1]. As shown by 
these authors, this method differs from the microarrays 
which have limitations due to (i) the difficulty to design 
specific probes, leading to artifacts caused by cross- 
hybridization and (ii) the impossibility to detect expres- 
sion for non-annotated genes. Expression quantification 
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performed using qRT-PCR is more precise than microar- 
rays, but is also not able to measure unknown genes. 
Moreover, the cost of TLDA (TaqMan Low Density Array - 
Applied Biosystems) for example, renders it unsuitable for 
large gene sets. 

The RNA-Seq protocol is a succession of technical 
steps followed by quantification. According to Illumina 
technology, (i) a cDNA library from a given tissue is ran- 
domly fragmented by sonication, (ii) specific adapters are 
ligated for the assignation of each fragment to the corre- 
sponding sample, (Hi) PCR amplification are performed, 
and (iv) amplified mRNA fragments with sizes ranging 
from 250 to 450 bp are isolated before being sequenced. 
The quantification of the sequenced fragments (called 
reads) begins with the mapping of each read onto the as- 
sembled genome or transcriptome, in order to count the 
number of reads assigned to each known or unknown 



© 2014 Filloux et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative 
Commons Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain 
Dedication waiver (http://creativecommons.Org/publicdomain/zero/1.0/) applies to the data made available in this article, 
unless otherwise stated. 



Filloux et al. BMC Bioinformotics 2014, 15:188 
http://www.biomedcentral.com/1471 -21 05/1 5/1 88 



Page 2 of 1 1 



gene. When there are several transcripts or close paralo- 
gies for a gene, the attribution of a read to the right tran- 
script is not always possible depending on the read 
position: 5'-end fragments are expected to be more spe- 
cific than 3'-end ones. The second step of quantification 
consists in removing four biases affecting read counts: 
(i) the number of reads increases with the size of the 
transcript [2-6], (ii) with the amount of the cDNA li- 
brary [7,8], (Hi) sequencing efficiency decreases when 
the GC-content is too low or too high [9-12], and (iv) 
due to a PCR amplification step during the library prep- 
aration, PCR duplicates occur when two copies of the 
same cDNA fragment produce different clusters on the 
flow cell [13-15]. 

Since RNA-Seq emergence, a number of normalization 
methods have been developed to address one or two of 
the different biases [1-12,14]. Our aim was to develop an 
integrated method able to correct all these sources of 
bias. In order to avoid RNA-Seq quantification problems 
linked to specific isoforms, unlike most studies, we only 
retained genes with a single transcript to determine the 
various equations and to perform the comparison [16]. 
As for size effects, most of them are based on mathem- 
atical distribution models to compare expression levels 
between samples, but do not consider separately the op- 
posite biases relative to size: short transcripts (<600 bp) 
are underestimated while longer ones are overestimated. 
As for the bias linked to GC content, we performed sim- 
ple regression methods based on polynomial model. It 
appeared that sequencing depth has an effect on the 
equations driving the size and GC content corrections. 
Hence, unlike other methods, a further run of our pro- 
gram was performed to correct globally the read counts 
by taking into account size, GC content and total read 
numbers. In order to assess the efficiency of our approach, 
we calculated the correlation between corrected RNA-Seq 
counts and qRT-PCR quantifications. 

Methods 

RNA extraction 

Longissimus thoracis muscle biopsies were taken between 
the 7 th and 9 th ribs of 125 limousine bulls slaughtered at 
the age of 15.8 months. The samples were immediately 
frozen in liquid nitrogen and stored at -80°C. After grind- 
ing tissues using a FastPrep FP120 Homogenizer device 
(Thermo Savant) and micro-tubes "Lysing Matrix D" (MP 
Biomedicals), RNA extractions were performed with the 
RNeasy Midi/Maxi kit (Qiagen). The procedure and solu- 
tion quantity were optimized for extraction from skeletal 
muscles and treatment with proteinase K as recom- 
mended by the supplier. The quality control of RNA step 
was done using RNA 6000 Nano Chips analyzed with 
2100 Bioanalyzer instrument (Agilent Technologies). The 
22 best ranking RNA samples were retained. 



RNA sequencing 

To verify the absence of degradation during the storage 
period, the quality of these 22 cattle samples were then 
checked again before preparing cDNA libraries according 
to the Illumina protocol. Briefly, mRNAs were isolated 
from total RNA by their polyA tails and cDNA libraries 
were built using random-hexamers. These cDNAs were 
fragmented by sonication, and specific adapters were then 
ligated to each fragment for the traceability of the sample. 
Ten cycles of PCR amplification were performed. Amp- 
lified mRNA with a size between 250 and 450 bp were 
then isolated before being sequenced in paired-end 
reads with a length of 100 bases using Illumina High- 
Seq2000 device (hosted at the INRA Genomic Platform 
of Toulouse, France). 

RNA-Seq read counting 

The first step consists in de-multiplexing the reads by 
recognizing specific adapter sequences to assign each 
read to the corresponding sample (three samples were 
pooled per flow cell lane). From 100 to 240 million paired- 
end reads were obtained per flow cell lane, corresponding 
to 27 to 91 million reads for each cDNA library. These 
paired-end reads were then mapped back to the bo- 
vine reference transcriptome, using Bos taurus known 
transcripts recorded in the Ensembl database v.61 
(Website: ftp://ftp.ensembl.org/pub/release-75/fasta/bos_ 
taurus/cdna/). This set contains 27,663 transcript 
sequences assigned to 21,734 known genes and pseudo- 
genes. Paired-end reads located exactly on the same tran- 
script were selected and counted. A total of 21,455 
transcripts (17,605 genes) were identified, with at least 
one paired-end read within the 22 analyzed samples. 

qRT-PCR quantification 

Among the 22 cattle samples, five of them were chosen to 
perform qRT-PCRs on the basis of a large range of total 
read numbers. These samples showed around 10. 10 6 
(1475), 13.10 6 (1455), 20.10 6 (1479), 24.10 6 (1345), and 
30. 10 6 reads (1476), respectively. These experiments were 
conducted using custom-made TLDA (Taq-Man Low 
Density Array) cards and ABI PRISM 7900HT sequence 
detector system (Applied Biosy stems). The dataset was 
built with genes involved in glycosylation metabolism, 
named glyco-genes in the following. They concern 
glycosyl-transferases, glycosidases, sulfo- transferases, sugar 
carriers, and lectines. Among the around 800 genes re- 
corded in the bovine genome (unpublished data), 372 
were selected according to two criteria: the greatest di- 
versity of the glyco-gene groups and the availability of 
primers provided by Applied Biosystems (https://bioinfo. 
appliedbiosystems.com/genome-database/gene-expression. 
html). Twelve housekeeping genes (18S RNA, TFIID tran- 
scription factor, etc., see [17]) were added as controls to 
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complete the 384-microwells of each microfluidic card. 
The quantification was done using the SDS 2.3 software 
(Applied Biosystems) according to the ACt method (see 
the User Bulletin #2 for ABI PRISM 7700 of October 
2001). Briefly, ACt corresponds to the threshold cycle (Ct) 
for each gene minus that of the mean of the twelve 
endogenous internal controls. 

RNA-Seq data from public datasets (drosophila and 
human) 

To validate the first steps of our method, it was necessary 
to consider public data dealing with other organisms than 
Bos taurus. 

As for Drosophila, in the public dataset SRA: 
SRP009459/GEO: GSE33905 deposited by B.R. Graveley 
and co-workers, we downloaded 16 read sequence sets 
obtained from head of male and female adults 
(GSM838758 to GSM838760, GSM838763 to GSM838766 
to GSM838780, and GSM838799 to GSM838802). The se- 
quencing depth varied from 2.7 to 8.4 million reads, with a 
mean value close to 5 million reads. 

As for Human, we considered the dataset SRA: 
SRP032775/GEO: GSE52166 deposited by R. Sanka and 
co-workers. In order to have homogeneous data, the 
read sequence sets came from whole blood of 20 indi- 
viduals in a pre-infection state relatively to Plasmodium 
falciparum (GSM1335718, GSM1335720, GSM1335722 
to GSM1335756). The sequencing depth varied from 
21.2 to 72.9 million reads, with a mean value around 40 
million reads. 

Using STAR aligner software v.2.3.1f [18], the read 
sequences were splice-aligned onto the Drosophila v. 
BDGP5.75 or the Human genome v.GRCh37.75, respect- 
ively. Transcripts were quantified with sigcufflinks (avail- 
able upon request at www.sigenae.org), a modified version 
of the cufflinks code [19] providing raw read counts per 
transcript, by using the GTF reference files provided by 
Ensembl (version 75). 

Simulation of RNA-Seq data 

As we suspected that the RNA fragment sizes have an 
impact on the behavior of read counts as a function of 
transcript sizes, it was useful to conduct simulation 
using a specific program. We first downloaded the tran- 
script genes of Bos taurus chromosome 20 from Ensembl 
(version 75 - genome assembly UMD3.1). All the se- 
quences were concatenated to obtain a single sequence of 
255,601 bp. This sequence was then split into 231 
genes in the FASTA format, with increasing sizes from 
50 bp to 1,200 bp according to an arithmetic progres- 
sion with common difference of 5. This file was sub- 
mitted to rlsim [20]. Default parameters were chosen, 
except for sequenced fragment range, and total read 
number (1 million). Three runs were launched, the first 



with 250-450 (mean 350) bp, the second with 450-650 
(mean 550) bp, and the last 650-850 (mean 750) bp. 
For each transcript, the program assigns an expression 
level from a mixture of gamma distribution with two 
components with mean 5,000 and 10,000. Then, the 
simulation provides for each read its sequence and the 
assigned gene. We then calculated the number of reads 
for each gene using the program Fishing-net, written in 
Perl, available upon request from CF and DP. 

Results 

As qRT-PCR quantification were used to validate our 
RNA-Seq normalization method, it was necessary to verify 
that qRT-PCR data were not subject to transcript size and 
GC content biases. As for transcript size, we tested a rela- 
tionship with the ACt obtained by qRT-PCR for cattle 
sample 1479 (n = 233). Through polynomial equations of 
first and third orders, the ^-values were 0.84 and 0.87, re- 
spectively. As for GC content in the same sample, the cor- 
responding polynomial equations gave ^-values of 0.57 
and 0.96, respectively. We verified that for the four other 
cattle samples, no significant relationships were observed 
neither for transcript size nor for GC content (Additional 
file 1). 

To compare qRT-PCR results with the RNA-Seq 
approach, several steps of correction are needed. The 
calculations concerned 14,676 genes for which only one 
transcript were detected in Cattle. We propose an inte- 
grated method called SGTR (transcript Size, GC content 
and Total Read number) that takes into account the ef- 
fects of transcript size, GC content, and total read num- 
ber. First, it was necessary to apply a log 2 transformation 
to raw counts to avoid large dispersion for high values 
according to [2,10,21] and [22]. 

Correction of transcript length biases 

We sorted the transcripts according to their size and 
built length classes: the class n contains all the transcripts 
for which the size is comprised in the [ n ; n + 99 ] inter- 
val. As for example, the cattle sample 1479 resulted in 
the Figure 1A. It is clear that two parts can be observed 
on both sides of the size 600. The regression equations 
for transcripts < 600 bp and > 600 bp are respectively as 
follows: 

yi = axJCi + bi (1) 
and 

y i = a 2 .x i + b 2 (2) 

where y t corresponds to the mean read number for the 
size class x t . 

We observed that the slope ax for shorter transcripts 
was higher than the one « 2 for longer transcripts, and 
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Figure 1 Method implemented to correct the biases linked to transcript size. Size classes were built every 100 bp for transcripts < 5000 bp, 
as too few transcript numbers were observed with a size > 5000 bp leading to scattered dots. The dotted line separates the red regression line 
corresponding to transcripts < 600 bp and the blue one to transcripts > 600 bp. The vertical axis corresponds to log 2 transformed values of read 
numbers. A) Cattle sample 1479. B) Drosophila sample SRR384925 (7132 genes with a single transcript). C) Human sample SRR1 177729 (16,228 
genes with a single transcript). D) Representation of sample 1479 after correction (green). E) RNA-Seq simulated data using rlsim, where Loess 
smoothing was applied to each series. The blue points correspond to a run where the sequenced fragments are in the range of 250-450 bp, the 
red one to the range of 450-650 bp, and the violet one to the range of 650-850 bp. 



verified that this trend was also true for the 21 other cat- 
tle samples analyzed by RNA-Seq. In particular, the 
600 bp border remained constant In other species (e.g. 
Drosophila and Human), we retrieved this 600 bp border 
in all the samples tested (16 drosophila head samples 
and 20 human whole blood samples). One example of 
each of this species is presented in Figures IB and 1C. 
To correct the bias linked to transcript sizes, it was ne- 
cessary to introduce two different equations correspond- 
ing to each part of the graph. As size 600 is a critical 
value, we decided to adjust all the read numbers to this 
size. First consider the left part; for a transcript of size S, 
we added the value " aj (600 - S) " to the observed read 
number. Likewise, for transcripts > 600 bp, we removed 
the value " a 2 (S - 600) ". As a result, the read numbers 
of all the transcripts were adjusted to the size 600 
(Figure ID). 

To understand the significance of this 600 bp border, 
we hypothesized that it could be due to the length of the 
sequenced fragments. This idea was tested using the 



simulation procedure implemented in rlsim software. 
Three different fragment lengths were considered: 250- 
450 (mean 350) bp, 450-650 (mean 550) bp, and 650- 
850 (mean 750) bp, with a fixed total read number of 1 
million. The results are summarized in Figure IE, with 
LOESS smoothing. It is difficult to give a precise position 
of the break point between the two regression lines, but it 
is clear that the greater the sequenced fragments, the more 
the break point is shifted toward the right. Moreover, the 
slopes for the regression lines situated before the break 
points seem to be similar. 

To assess the efficiency of our method, we calculated 
the Pearson correlations between qRT-PCR and RNA-Seq 
counts corrected by FPKM (Fragments Per Kilobase per 
Million mapped reads) [23] or SGTR for the five bovine 
samples. We choose FPKM as it is a one of the most fre- 
quently method used for normalization. Briefly, it consists 
in dividing the fragment counts by transcript size and the 
total number of reads, and adjusted to 1 kb and 1 million 
reads. 
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Among all the genes detected by qRT-PCR and RNA- 
Seq methods, we considered five sub-samples according 
to the class size of transcripts. The results shown in Table 1 
indicate that the correction by FPKM is improved by 
transforming the raw values by their logarithms. Whatever 
the samples, the ^-values observed for FPKM were largely 
worse than the one corresponding to log 2 (Tl?YM), except 
for transcripts < 1,000 bp and for transcripts > 4,000 bp of 
the sample 1475 and 1455 which presented the lowest 
sequencing depth. This resulted from the distribution of 
values illustrated in Figure 2A and 2B. Consequently, fur- 
ther comparisons will only be made on the /og^FPKM) 
values. When we compare SGTR correction according to 



size only with the previous normalization, the ^-values 
were generally of the same order of magnitude. Never- 
theless, we observed slightly better results with our 
method for transcripts < 1,000 bp but faintly worse re- 
sults for transcripts between 1,000 and 2,000 bp, whatever 
the sample. 

Removing of the GC-content effect 

For the gene dataset of each cattle sample, we first cal- 
culated the trend curve for read numbers according to 
GC content. Polynomial equations of different order were 
tested and revealed dissymmetric dome shaped curves: the 
left increasing part (GC from 35 to 40%) was hardly visible 



Table 1 Comparison between FPKM and SGTR methods according to transcript size 
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N corresponds to the number of analyzed genes. The five samples (1475, 1455, 1479, 1345, and 1476) refer respectively to samples with a total read number 
around 10.10 6 , 13.10 6 , 20.1 0 6 , 24.1 0 6 , and 30.1 0 6 reads. Abbreviations: SGTR size: correction for transcript size; and SGTR Size and GC content: correction for 
transcript size and GC content. Only the p-values of Pearson correlation with qRT-PCR quantifications are indicated, p-values were calculated using the Past3 
program [24]. 
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Figure 2 Relationships between RNA-Seq normalization methods and qRT-PCR quantifications (Cattle sample 1479). A) FPKM corrected 
values. B) /og 2 (FPKM) corrected values. C) SGTR corrected values including size and GC content bias correction. 



by comparison to the right decreasing one (GC from 45 
to 80%), where decreasing trend was getting more and 
more pronounced for GC > 50% (data not shown). We 
retained a third order polynomial function that clearly 
showed this last trend in all the samples, giving the 
Equation 3 (Figure 3A). Below the 50% threshold, the 
mean read numbers remained fairly constant. 



y = c.x 3 d.x 2 + e.x +/, 



(3) 



where y represents the size-corrected mean read number 
and x the GC content. 

Second, for each GC percentage, we calculated the 
difference between the GC content and 50% that we 
applied to the previous polynomial equation leading to 
the Figure 3B. The best fitting polynomial function was 
then deduced: 



J — g- x + h.x + i.x + 



(4) 



where y corresponds to the predicted read number and 
x represents the difference between a GC content and 
50%. 

Third, we adjusted the size-corrected values by remov- 
ing "g.x 3 + h *X ~\~ l*X " of this last function to all the tran- 
scripts. The Figure 3C illustrates the efficiency of GC 
bias correction. 

For the 20 human samples, we obtained the same pro- 
files of size-corrected read number according to GC con- 
tent as in Cattle (data not shown). In contrast, for the 16 
drosophila samples, the polynomial curves were different 
(Figure 3D and 3E). Nevertheless, the correction of the 
GC content bias using the previous procedure yielded a 
smoothing curve absolutely flat (Figure 3F), attesting the 
efficiency of our method. 

The final step consisted in testing the effect of the GC 
content bias removing on the correlation between RNA- 
Seq counting and qRT-PCR quantification, in the case of 



bovine data. Except for the sample 1475 (10.10 6 reads), 
this last bias correction improved the global correlation 
with qRT-PCR quantifications relatively to the simple 
size correction by SGTR (Table 1). By comparison with 
/og2(FPKM) correction, the removing of size and GC con- 
tent biases improved the global correlation with qRT-PCR 
results, except for the sample 1455 which presented a low 
sequencing depth (13.10 6 reads) and showed similar re- 
sults as log 2 (¥PKM) correction. The Figure 2C illustrates 
the correlation obtained between SGTR including size 
and GC content corrections and qRT-PCR quantifica- 
tions for the sample 1479. We observed a better propor- 
tionality than the one provided by /og^FPKM) correction 
(Figure 2B). 

Adjustment according the total read number 

For the 22 cattle samples, we calculated the correlations 
between total read numbers and the regression parame- 
ters (tff, bj, a 2 , and b 2 ) as given in Figure 1. Except the 
coefficient b 1} all the coefficients were positively corre- 
lated with total read numbers (Figure 4 A, 4B, and 4C). 
The new regression parameters were defined as u t and 
Vi for the slope and constant respectively: 



y t = u x .Xi + Vi, 
y t = u 2 .Xi + v 2 , 



and 



y t = u 3 .Xi + v 3 



(5) 
(6) 

(7) 



where y t corresponds to the regression parameter for a 
total read number x t . 

As 20 million was close to the mean value of total read 
numbers among the 22 samples, we decided to adjust all 
TRN (Total Read Numbers) to 20 million. For transcripts 
<600 bp and TRN, we corrected the /og^transformed 
read numbers by adding the value " u ± (20.10 6 - TRN ) " 
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GC content GC content 



Figure 3 Method implemented to correct GC content biases. Variations in size-corrected mean read numbers according to GC content. The 
polynomial equations are indicated above (A: Cattle sample 1479, and D: Drosophila sample SRR384925). Application of the previous equation 
(Eq.3) to differences between 50% GC content and each GC content value, giving the equation indicated above (Eq.4) (B: Cattle sample 1479, and 
E: Drosophila sample SRR384925). Effect of GC content bias correction on the whole dataset. Clearly, no remaining dependence can be observed: 
the p-value to third order polynomial equation is 1.00 (C: Cattle sample 1479, and F: Drosophila sample SRR384925). 



to the parameter a± in the Equation 1. Consequently, 
for a transcript of size S, we corrected the values with 
the following equation: 

Ji = ^°&2 ( rea( l numbers) 

+ [ai + ui (20.10 6 -77W)] (600-5) (8) 

where y t corresponds to the corrected read number for 
the transcript x t . 

Likewise, for transcripts > 600 bp, we added the value " 
u 2 (20. 10 6 - TRN) " to the parameter a 2 in the Equation 2, 
and adjusted the corrected value by adding " [u 3 ( 20. 10 6 - 
TRN)] ". As a result, the /^-transformed read numbers 
were corrected with the subsequent equation: 

Ji = /o^ 2 (read numbers)- [a 2 + u 2 (20. 10 6 -TRN)] 
x (5-600) + u 3 (20.10 6 -TRN) 

(9) 

where y t represents the corrected read number for the 
transcript x t . 



On the other hand, after calculating the Eq.4' correspond- 
ing to the Eq.4 based on the size- and TRN-corrected 
values, we determined the correlations between TRN and 
the regression parameters for GC content (defined as g' hi 
i\ and f, as in Figure 3B). It appears that none of these pa- 
rameters were linked to sequencing depth. Consequently, 
we corrected the GC content bias by removing " g\x 3 + h\ 
x 2 + V.x " to the size- and TRN-corrected values, giving 
the following equation: 

Ji = /og2( size -and TRN-corrected values) 

-(g'-x 3 + h\x 2 + i'.x), 

(10) 

where y t corresponds to the full-corrected read num- 
ber, and x to the difference between the GC content 
and 50%. 

Lastly, the negative final values were considered as null. 
It should be noted that when we applied the correction 
due to TRN, the correlations between SGTR and qRT- 
PCR quantifications became slightly better comparatively 
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Figure 4 Correlation between regression parameters and total read numbers (Cattle sample 1479). A) Slope a 1 for transcripts < ( 
B) Slope a 2 for transcripts > 600 bp. C) Constant b 2 for transcripts > 600 bp. The equations are indicated below the regression lines. 



i bp. 



to the previous SGTR steps (Size and GC content cor- 
rections), except for the samples 1475 and 1476 which 
present the lowest and the highest sequencing depth 
(Table 2). In summary, the full SGTR correction showed 
better results than /og^FPKM), except for the sample 
1475. 

Discussion 

Our results showed that non-transformed counts values 
from RNA-Seq presented worse correlations with qRT- 



PCR quantification than the /o^ 2 -transformed ones, as 
already stressed by [2,10,21], and [22]. The prior trans- 
formation of read counts by log 2 function was motivated 
by the variability of data corresponding to highly expressed 
genes, often observed in large size transcripts. We hypoth- 
esized that this transformation could also attenuate the 
overestimations due to PCR duplicates. Indeed, the more 
expressed the transcripts, the higher the probability to 
generate duplicates (several clusters of reads share exactly 
the same start and end) [13,15]. Otherwise, certain authors 



Table 2 Correction of the impact of total read numbers 


N 


Samples 


log2(FPKM) 


SGTR - Size 


SGTR - Size and GC content 


Full SGTR 


159 


1475 


7,96E-39 


1.71 E-39 


4.80E-39 


1 .08E-38 


155 


1455 


2,95 E-39 


5.86E-39 


3.40E-39 


2.66E-39 


All genes 162 


1479 


8,24E-44 


1 .37E-43 


2.02E-49 


1.21 E-49 


152 


1345 


1,57E-42 


1 .86E-42 


6.11E-44 


5.64E-44 


162 


1476 


1,73E-41 


6.74E-41 


1.51 E-44 


2.28E-44 



N corresponds to the number of analyzed genes. The five samples (1475, 1455, 1479, 1345, and 1476) refer respectively to samples with a total read number 
around 10.10 6 , 13.10 6 , 20.1 0 6 , 24.1 0 6 , and 30.1 0 6 reads. Abbreviations: SGTR size: correction for transcript size; SGTR Size and GC content: correction for transcript 
size and GC content; and Full SGTR: correction for transcripts size, total read number, and GC content. Only the p-values of Pearson correlation with qRT-PCR 
quantifications are indicated. 



Filloux et al. BMC Bioinformotics 2014, 15:188 
http://www.biomedcentral.com/1471 -21 05/1 5/1 88 



Page 9 of 1 1 



have proposed to apply /(^-transformed values to the data 
extracted from qRT-PCR [25,26]. Given our regression 
curves, it is clear that for our samples, this correction is 
inappropriate (unpublished data). 

As for transcript size correction, two strategies have 
been adopted by different authors. In the first one, the 
transcripts are ranked in quantiles containing identical 
numbers [2,6,7]. The advantage is a balanced distribu- 
tion facilitating further statistical analysis. However, it is 
difficult to assign a mean read number to scaled sizes. In 
the second one, size classes are built irrespective of the 
number of genes per class [4], leading to an increasing 
dispersion for the classes of higher sizes (mainly due to 
lower number of genes). Both approaches allowed avoid- 
ing certain limitations implemented in RPKM (Reads Per 
Kilobase of exon model per Million mapped reads) [1] or 
FPKM [23] methods, where the number of read is simply 
divided by transcript size. The main difference consists 
in taking into consideration paired-reads in the FPKM 
method while only simple reads in the RPKM one. 

We choose the second strategy because of the excellent 
regression quality of mean read numbers by size classes. 
We interpret the border 600 bp observed whatever the 
species dataset (Figure 1A-1C) as a result of sonication 
and selection of cDNA fragments between 250 and 
450 bp. Indeed, fragments > 600 bp are all the more so 
represented that they are long [1,3,4,27]. Conversely, 
the fragments < 600 bp are under-represented as many 
small segments were not sequenced. Moreover, the 
simulation conducted with rlsim confirmed our view, 
and showed that the border increases with the size of 
the sequenced fragments (Figure IE). Hence, this proves 
the effect of the cDNA fragments size selection on the 
break point between the two regression lines. As a re- 
sult, independent corrections are needed for both tran- 
script sizes. This last point provided slightly better 
correction than the /qg^FPKM) for transcripts < 1,000 bp 
(see Table 1). According to [14,28] and [29], RNA-Seq 
protocol including PCR in the first steps introduced biases 
linked to GC content, as cDNA fragments with high GC 
and AT content are under-sequenced. To correct this bias, 
[10,14] and [30] proposed to build GC-classes. In our 
method, we took into account the general trend by calcu- 
lating a three order polynomial equation, which was used 
to correct the decrease over 50% GC content. The effi- 
ciency of our correction was sample-dependent and more 
precisely linked to sequencing depth. Indeed, for a low 
number of reads, the GC bias correction did not improve 
the normalization, in contrast to samples with higher 
sequencing depth. SGTR including Size and GC content 
corrections provide thus globally better results than log 2 
(FPKM) (Table 2), which is in agreement with the con- 
clusions of [8] and [10]. We expect that the GC content 
correction should be more accurate if it was applied on 



gene segments (-300 to 500 bp) and not on full length 
transcripts, as there are variations along the sequence in 
their GC content. 

Lastly, since the sequencing depth introduces effects on 
transcript size bias, we adjusted the TRN to 20 million 
reads in reason of its medium value. Hence, we modified 
the parameters a 1} a 2 , and b 2 , but this step requires 
numerous samples to obtain reliable values. Finally, 
these size and TRN adjusted values were then corrected 
for GC content bias. 

Our integrated method corrects some biases linked to 
transcript size and GC content, but also sequencing depth. 
However, it is striking that for the lowest sequencing 
depths (sample 1475: 10 million reads; 1455: 13 million 
reads) our correction gave worse or equal correlations 
with qRT-PCR values than /qg2(FPKM). In contrast, for 
read counts over 20 million, our method significantly 
improves the read counting, for the whole dataset and for 
most gene size classes. The question is to interpret this 
observation and several considerations have to be taken 
into account. First, in our samples, when the total number 
of reads is low, it is particularly true for transcript with 
sizes shorter than 600 pb, the regression equation between 
transcript size and read counts is less accurate than the 
one for transcript sizes longer than 600 pb. Second, the 
more expressed the transcripts (total read numbers over 
20 million), the higher is the probability to generate dupli- 
cates and other biases induced by RNAseq. Our method 
can be compared to GAM (Generalized Additive Model) 
of [11], where the data are corrected for length, GC con- 
tent, and dinucleotide frequency biases. However, these 
authors have shown that the correction of dinucleotide 
frequency biases did not improve results. Unlike GAM 
method, our model is not additive as we showed that the 
regression coefficient linked to transcript length depend 
on the sequencing depth. That was not the case for poly- 
nomial equation coefficients used to correct the GC con- 
tent bias. Improvements are still needed to better take 
into account the variation of GC content per read in a 
given transcript, as the GC content is not homogeneous 
along the sequence. Protocols excluding PCR in first step 
could avoid this issue, and problems linked to PCR dupli- 
cates [13,15,28]. On the other hand, it is highly desirable 
to provide a good estimation of the number of reads cor- 
responding to each transcript isoform. To overcome this 
issue, we took into account genes presenting only one 
transcript. In contrast to Human [11], this choice does not 
result in a dramatic loss of information as more than 50% 
of bovine genes have a single transcript in the available 
annotation file. The accurate determination of transcript 
size suffers from biases linked to cDNA library prepar- 
ation. Indeed, it seems that random-hexamers present 
some favored and disfavored sites, so that specific regions 
are selected more easily than others leading to biases for 
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low expressed genes [1,3132]. RNA fragmentation before 
its reverse-transcription in cDNA reduces this bias leading 
to more uniform gene coverage [33]. Nevertheless, these 
technical effects associated to library preparation as well 
as some variations observed between flow cells have al- 
ways a smaller influence that the biological effect [6,9]. 
Otherwise, the fine determination of TSSs (Transcription 
Start Sites) deduced from alignment of the reads onto 
the genome (and not onto the known transcripts) could 
further improve the accuracy of transcript size. 

Conclusions 

We demonstrated that our method is robust and suitable 
to compare the read counts of genes for numerous sam- 
ples of the same tissue. All the steps described are sequen- 
tially automated within SGTR program written in Perl, 
and available upon request from RP and DP. The exten- 
sion of our method to the normalization of the read 
numbers between different tissues requires considering 
a set of reference genes as calibrators. 

Animal ethics 

All animal experimentation complied with the French 
Veterinary Authorities' rules. No ethics approval was re- 
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